Kernel method for clustering based on optimal target vector 
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We introduce the notion of optimal target vector, and describe how it creates a link between super- 
vised and unsupervised learning. We exploit this notion to construct Ising models, for dichotomic 
clustering, whose couplings are (i) both ferro- and anti-ferromagnetic (ii) depending on the whole 
data-set and not only on pairs of samples. The effectiveness of the method is shown in the case of 
the well known iris data-set and in benchmarks of gene expression levels. 

PACS numbers: 05.10.-a,87.10.+e 

Recent years have been characterized by a dramatic evolution in many fields of fife science with the apparition and 
rapid spread of so-caUed high-throughout technologies, which generate huge amounts of data to handle various aspects 
of biological samples or phenomena. The need for efficient methods to represent, analyze and finally make sense out 
of these data triggered the development of numerous data analysis algorithms (some of which physically motivated 
P, Q ) • Among them, kernel methods have quickly gained popularity for problems involving the classification and 
analysis of high-dimensional or complex data. Well known supervised kernel algorithms are Support Vector Machines 
0, both for regression and classification, and Kernel Ridge Regression A popular unsupervised kernel method 
is Kernel Principal Components 0, a nonlinear projection tool which generalizes the classical principal component 
analysis. The purpose of the present work is to propose a simple connection between supervised and unsupervised 
kernel methods, which here will permit us to introduce Ising models for dichotomic clustering. 

We recall kernel ridge regression, while referring the reader to Q or |^ for further technical details. Let us consider 
a set of £ independent, identically distributed data S = {(xi,yi)}l^^, where is the n-dimensional vector of input 
variables and yi is the scalar output variable. Data are drawn from an unknown probability distribution p(x, y). The 
problem of learning consists in providing an estimator y — /(x) out of a class of functions, called hypothesis space. 
In kernel ridge regression, calling y = {yi,y2, the vector formed by the i values of the output variable, the 

estimator is given by: 

e 

2/ = /(x) = ^Qfc(x,;,x), (1) 

i=l 

where coefficients {ci} are given by 

c=(K + AI)-V, (2) 

K being the £ x £ matrix with elements Kij = fc(xi,Xj). A is called regularization parameter, and its main role is 
to render well posed the inversion of matrix K which in most cases is nearly singular. fc(-, •) is a positive definite 
symmetric function, and equation Q may be seen to correspond to a linear estimator in the feature space 

$(x) = (yorVilx), Va2V'2(x), y/a^TpNi^), ■■■), 

where ai and ipi are the eigenvalues and eigenfunctions of the integral operator with kernel k. 

Many choices of the kernel function are possible, for example the linear kernel A:(x,x') = x • x' leads to the usual 
linear estimator. The polynomial kernel of degree p has the form /c(x,x') = (1 + x • x')^ (the corresponding features 
are made of all the powers of x up to the p-th). The Gaussian kernel is fc(x, x') = exp— (||x — x'|p/2a^) and deals 
with all the degrees of nonlinearity of x. Specifying the kernel function k one determines the complexity of the 
function space within which one searches for the estimator, similarly to the effect of specifying the architecture of a 
neural network, that is number of layers, number of units for each layer, etc. Due to and (0), the predicted output 
vector y, in correspondence of the true target vector y, is given by y= Gy, where the symmetric matrix G is given 
by 

G = K(K + AI)"\ (3) 
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Note that matrix G depends only on the distribution of {x} values: G embodies information about the structures 
present in {x} data set. Indeed, the matrix element Gy quantifies how much the target value of the j — th point 
influences the estimate of the target of point i. The training error E can be calculated as follows 

(y-Gy)T(y-Gy) =y^Hy, 

where H = I — 2G + GG is a symmetric and positive matrix. 

It is worth stressing that, given a kernel function, the corresponding features 0-y(x) are not centered in general. One 
can show Q that centering the features {(j)^ — > (/)^ — {(j)-y), for all 7) amounts to perform the following transformation 
on the kernel matrix: 

K ^ K = K IfK Klf + IpKIe, 

where {Ie)ij = 1/^, and to work with the centered kernel K. Therefore in the following we will always assume that 
the kernel matrix K has been centered. 

Matrix G may be seen as a linear convolution filter, mapping target vectors onto the vector of predicted outputs, 
and carrying information about the structures naturally present in the set of {x} points. In the unsupervised case 
the data set is made of x points, and the target vector y is missing. However we may consider the following 

question: what is the vector y such that treating it as the target vector leads to the best fit, i.e. the minimum training 
error y^Hy? One may expect that this optimal target vector would bring information about the structures present 
in the data. We look for the optimal target vector in the space of binary functions, ct G { — 1, 1}^, the minimizer of the 
training error (j^Ha thus naturally provides a partition of points in two classes. The minimizer is the ground state 
of an Ising model (see, e.g., 0) with long range symmetric couplings J given by 

e 

Jij — 4Gij — 2 ^ ^ GikGkj, 

k=l 

for i ^ j, and Ju — 0. We note that, unlike the Potts model introduced in 0, here the couplings are both ferro- and 
anti-ferromagnetic. Therefore there is room for frustration and multiple minima. Many algorithms can be used to 
find an estimate of the ground state of Ising models, for example simulated annealing 8]. Here we use the mean field 
annealing method 9] , which iteratively solves the mean field equations for the local magnetization vector m (a) , 

m = tanh (/3Jm) , (4) 

while decreasing the temperature (increasing P). The starting temperature Per may be chosen as the inverse of the 
maximum eigenvalue of matrix J. Note that m g { — 1, 1}^ in the limit of large /3. 

Let us now discuss the application to iris data-set, published by Fisher [l^, where the sepal length, sepal width, 
petal length, and petal width are measured in millimeters on fifty iris specimens from each of three species, iris setosa, 
iris versicolor, and iris virginica. Firstly we process all the 150 points, using a linear kernel and A = 1: we find two 
clusters, one is made of 51 points (50 belonging to iris setosa), the other is made of the remaining 99 points. Then, 
we process the 99 points with linear kernel and the same value of A, obtaining two clusters of 48 points (47 belonging 
to iris versicolor) and 51 points (49 belonging to iris virginica). Globally four points are misclassified, with efficiency 
of classification 0.973. In figure 1 we show the four misclassified points. Results are quite insensitive to A: the same 
classification is obtained varying A € [0.05, 100]. Using a Gaussian kernel we obtain even better efficiency: using a — 3 
only two points are misclassified, with efficiency 0.987. 

Now we consider application on gene expression data sets. Firstly the colon cancer data set of jl^j, consisting in 
40 tumor and 22 normal colon tissues samples, each described by 2000 gene expression levels; data are available on 
the Kent Ridge Bio- medical Data Set Repository We are not going to face, in this place, the important task of 
feature selection which is fundamental in the analysis of gene expression data. The following preprocessing is used to 
normalize data. First, for each gene, expression levels are rescaled so as to get unit mean over tissues. Then, for each 
tissue, expression levels are linearly transformed to have zero mean (over genes) and unit variance. Application of our 
method, using all the genes and a linear kernel (A = 1), leads to 12 misclassified points with an efficiency of 0.806. 
Then we select the 100 most discriminant genes, using a supervised step where nonparametric Wilcoxon test is used 
to asses the capability of each gene in discriminating tumor and normal tissues. After ranking genes by Wilcoxon 
test, we apply our method (still with linear kernel and A = 1) using only 100 attributes corresponding to the first 
100 genes. The output is depicted in figure 2, seven points are misclassified with efficiency 0.887. Next we consider 
the leukemia data set of consisting of samples of tissues of bone marrow samples, 47 affected by acute myeloid 
leukemia (AML) and 25 by acute lymphoblastic leukemia (ALL), 7129 attributes. Data are normalized as in the 
previous example. Using all genes, and linear kernel, leads to a poor performance (efficiency 0.569). Ranking genes 
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by Wilcoxon test (assessing the capability of each gene in discriminating AML and ALL tissues) and selecting the 
best 500 genes, leads to better performances: our method with a linear kernel misclassifies 8 points, with efficiency 
0.889, see figure 3. Our results are quite insensitive to the choice of A. 

We now summarize our method. Dichotomic clustering is described as the problem of evaluating the ground 
state of Ising models, with couplings values determined in the frame of kernel methods for supervised learning. The 
effectiveness of the proposed approach is shown here on three real examples in terms of high efficiency of classification. 
As the output depends very weakly on the regularization parameter A, only the choice of the kernel matters. In this 
paper we mostly deal with the simplest kernel, the linear one: the problem of kernel selection, common to any kernel 
method, will be considered elsewhere. 

Some remarks are in order. Firstly, it is worth stressing that some information about structures can also be recovered 
from diagonal elements Ga . In the leave-one-out scheme Q a data point i is removed from the data set and the model 
is trained using the remaining i ~ 1 points: let us denote j/i the target value thus predicted, in correspondence of Xi. 
It can be shown that the leave-one-out error yi — yi and the training error yi — yi, obtained using the whole data set, 
satisfy: 



This formula shows that the closer Ga to one, the farther the leave-one-out predicted value from those obtained using 
also point i in the training stage. If point i is in a dense region of the feature space then one may expect that removing 
this point from the data-set would not change much the estimate since it can be well "interpolated" by neighboring 
points. Therefore points in low density regions of the feature space are characterized by diagonal values Ga close to 
one, while Ga is close to zero for points Xi in dense regions. In figure 4 we depict {Ga} for the case of iris. 

Secondly, one may look for the optimal target vector y in R^; to avoid the trivial solution y = 0, one may constrain 
y to have unit norm, y^y = 1. This problem then becomes equivalent to find the the normalized eigenvector of 
H with the smallest eigenvalue. On the other hand, matrix H is a function of matrix K: hence it has the same 
eigenvectors of K while the corresponding eigenvalues are related by a monotonically decreasing function. Therefore, 
optimal target functions in coincide with the kernel principal components: this shows that the method introduced 
in 1^ may be motivated also as the search for the optimal target functions. 

Finally, as already shown in the iris example, we remark that multiclass clustering may be obtained by repeated 
application of the proposed dichotomic method. The notion of optimal target vector is an interesting bridge between 
supervised and unsupervised learning: here we have considered application of this notion for dichotomic clustering. 
In our opinion it will carry to the developments of other effective algorithms for the analysis of complex data. 
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FIG. 1: In the plane of the first and second principal components, the iris data-set is represented. Symbols represent classes: • 
setosa, + versicolor, x virginica. The four misclassified points are surrounded by a circle. 





X 

+ 

+ 


sy X 

+ ^ X 
X +x ^ 








>^ ^ X 
X X X X 

©X 




- + 


+ 
+ 


+ 


X 

X 

X 


+ 


+ 




© ® - 


+ 

+ 


+ 


+ 





1.2' • • • ' • • 1 

-1.5 -1 -0.5 0.5 1 1.5 2 



FIG. 2: The colon data set, + normal and x tumor, is represented in the plane of the first and second principal components 
(extracted over the 100 most discriminating genes). Misclassified points are surrounded by a circle 
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FIG. 3: The leukemia data set, + AML and x ALL, is represented in the plane of the first and second principal components 
(extracted over the 500 most discriminating genes). Misclassified points are surrounded by a circle. 



1.5 



0.5 



-0.5 



-1.5 




Oj-, o °o. 



FIG. 4: The 150 points of iris data-set are depicted (first and second principal components). The radius of the circle around 
each point is proportional to 1 — Gu, the matrix G being evaluated using a Gaussian kernel with a = 0.5 and A = 1. 



